Optical character recognition systems provide a transformation of pixelized images of documents into ASCII coded text which facilitates searching, substitution, reformatting of documents etc. in a computer system. One aspect of OCR functionality is to convert handwritten and typewriter typed documents, books, medical journals, etc. into for example Internet or Intranet searchable documents. Generally, the quality of information retrieval and document searching is considerably enhanced if all documents are electronically retrievable and searchable. For example, a company Intranet system can link together all old and new documents of an enterprise through extensive use of OCR functionality implemented as a part of the Intranet (or as part of the Internet if the documents are of public interest).
However, the quality of the OCR functionality is limited due to the fact that the complexity of an OCR system in itself is a challenge. It is difficult to provide an OCR functionality that can solve any problem encountered when trying to convert images of text into computer coded text. One such problem is due to crossed out text that often may be encountered in documents. For example, a stamp with the text “COPY” may be applied onto a page of a document to signify that this document is not the original document, but a copy of the original document. Sometimes such documents have to be certified as a correct copy of the original document, which is typically done with additional stamps and a signature of a person entrusted to certify such copies, for example.
The common effect of the crossed out text or other objects overlaying characters is that characters in words will be hidden by the objects provided by for example the stamp or the handwritten signature, as described above, making a correct identification of the characters and the words comprising the characters difficult for an OCR system. Usually, OCR systems provide output data comprising a list of uncertainly recognized characters. Such crossed out characters etc. will therefore be identifiable as such, and their position on a text page, in words etc., possible alternative interpretations of the hidden or partly hidden character etc. may be reported by the OCR system.